suppressPackageStartupMessages(library(tidyverse))
Case study: how do features of nesting female horseshoe crabs influence the number of males found nearby?
Load the data. Here are the top six rows of 173 rows:
crab <- read_table("https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt", col_names = FALSE) %>%
select(-1) %>%
setNames(c("colour","spine","width","weight","n_male")) %>%
mutate(colour = factor(colour),
spine = factor(spine))
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## X2 = col_integer(),
## X3 = col_integer(),
## X4 = col_double(),
## X5 = col_double(),
## X6 = col_integer()
## )
knitr::kable(head(crab))
| colour | spine | width | weight | n_male |
|---|---|---|---|---|
| 2 | 3 | 28.3 | 3.05 | 8 |
| 3 | 3 | 26.0 | 2.60 | 4 |
| 3 | 3 | 25.6 | 2.15 | 0 |
| 4 | 2 | 21.0 | 1.85 | 0 |
| 2 | 3 | 29.0 | 3.00 | 1 |
| 1 | 2 | 25.0 | 2.30 | 3 |
Predictors: Colour; spine condition; carapace width; weight.
First, let’s see how carapace width influences the mean number of males nearby.
p <- ggplot(crab, aes(width, n_male)) +
geom_point(alpha=0.25) +
labs(x = "Carapace Width",
y = "No. males\nnearby") +
theme_bw() +
theme(axis.title.y = element_text(angle=0, vjust=0.5))
plotly::ggplotly(p)
Data source: H. Jane Brockmann’s 1996 paper; found online here; another regression demo with this data is found here.
These questions are meant to check your understanding of local regression.
What is the estimated mean number of nearby males for nesting females having a carapace width of 32.5? Use the following methods, by hand.
1. kNN with \(k=3\).
2. Using a moving window with a radius of 2.4.
3. Using a kernel smoother with Gaussian kernel with variance 1.
4. Using local polynomials with a radius of 2.4 and a flat kernel, first with degree 1, then with degree 2.
Optimize the loess fit by-eye. Just modify span, to keep things simple.
grid <- seq(min(crab$width), max(crab$width), length.out=100)
grid_df <- tibble(width = grid)
# FIT_MODEL_HERE
# PLOT_CURVE_HERE
What’s the error of this model? Training error is fine.
How well does this model answer our original question?
Fit a linear regression model. What’s the error?
How well does this model answer our original question? Do you see a potential problem with this model? Are any assumptions of linear regression not true? Brainstorm ideas for how to deal with the problems.
Fit a GLM. What’s the error?